CLAIMS 

1. A computer-based method for automatically finding and exploiting hidden, fuzzy 
algebraic constraints in a database, said method comprising the steps of: 

(a) constructing one or more candidates of form C=(#/, #2, P, ©), wherein aj and 
5 Q2 are numerical attributes associated with column values of data in said database, P is a 

pairing rule, and © is any of the following algebraic operators: +, -, x, or /; 

(b) constructing, for each candidate identified in (a), an algebraic constraint 
AC=(tf/, #2, P 9 ©, //,..., Ik) by applying any of, or a combination of the following 
techniques to a sample of column values: statistical histogramming, segmentation, or 

10 clustering, where //, ...,/* is a set of disjoint intervals and k > 7, and 

wherein said constructed algebraic constraints are used in query optimization. 

2. A compute-based method as per claim 1, wherein one or more pruning rules are 
used to limit said number of constructed candidates. 

15 

3. A computer-based method as per claim 2, wherein said pairing rule P represents 
either a trivial pairing rule or or a join between tables R and S and said pruning rules 
comprise any of, or a combination of the following: 

pairing rule P is of form R.a = S.b or of the form o R , and the number of rows in 
20 either table R or table S lies below a specified threshold value; 
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pairing rule P is of form R.a = S.b with a e K and the number of distinct values 
in S.b divided by the number of values in R. a lies below a specified threshold value, 
wherein K is a set comprising key-like columns among all columns in said database; 

pairing rule P is of form R.a - S.b, and one or both of R and S fails to have an 
index on any of its columns; or 

pairing rule P is of form R.a = S.b with aeK , and S.b is a system-generated key. 

4. A computer-based method as per claim 1, wherein said method further comprises 
the steps of: 

identifying a set of useful algebraic constraints via one or more pruning rules; and 
partitioning data into compliant data and exception data. 

5. A computer-based method as per claim 4, wherein said method farther comprises 
the steps of: 

receiving a query; 

modifying said query to incorporate identified constraints; and 
combining results of modified query executed on data in said database and said 
original query executed on exception data. 
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6. A computer-based method as per claim 4, wherein said partitioning is done by 
incrementally maintained materialized views, partial indices, or physical partitioning of 
the table. 

5 7. A computer-based method as per claim 2, wherein said pruning rules comprise 
any of, or a combination of the following: 

a j and ci2 are not comparable data types; 

the fraction of NULL values in either aj or ar? exceeds a specified threshold; or 
either column aj or ar? is not indexed. 

10 

8. A computer-based method as per claim 1, wherein said step of constructing one or 
more candidates further comprises the steps of: 

generating a set P of pairing rules; and 

for each pairing rule PeP, systematically considering possible attribute pairs (aj, 
15 02) and operators © with which to construct candidates. 

9. A computer-based method as per claim 8, wherein said step of generating a set P 
of pairing rules further comprises the steps of: 

initializing P to be an empty set; 

20 adding a trivial pairing rule of the form © R to said set P for each table R in said 

database; and 
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generating and adding nontrivial pairing rules to said set P based upon identifying 

matching columns via an inclusion dependency, wherein a column b is considered a 
match for column a if: 

data in columns a and b are of a comparable type; or 
5 either (i) column a is a declared primary key and column b is a declared 

foreign key for the primary key, or (ii) every data value in a sample from column b 
has a matching value in column a. 

10. A computer-based method as per claim 8, wherein said step of generating a set P of 

10 pairing rules further comprises the steps of: 
initializing P to be an empty set; 

adding a trivial pairing rule of the form or to said set P for each table R in said 
database; and 

generating a set K of key-like columns from among all columns in said database 
15 with each column in set K belonging to a predefined set of types T, said set K comprising 
declared primary key columns, declared unique key columns, and undeclared key 
columns, wherein said primary keys or declared unique keys are compound keys of form 
a = (a h ....a^eT" for m>l\ 

adding nontrivial pairing rules to said set P based upon identifying matching 

20 compound columns via an inclusion dependency wherein, given a compound key 
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(ai,... t a„JeK, a compound column b is considered a component wise match for 
compound column a if: 

data in compound columns a and b are of a comparable type; or 
either (i) compound column a is a declared primary key and compound 
5 column b is a declared foreign key for the primary key, or (ii) every data value in a 

sample from compound column b has a matching value in compound column a, 

11. A computer-based method as per claim 1, wherein said step of constructing 
algebraic constraints further comprises the steps of: 
10 constructing a sample set Wc of an induced set Qc, wherein P is a join predicate 

between tables R and S and Q c = {r.a } © r.a 2 : r € R} when the pairing rule P is a trivial 

rule 0 R and Q c = {r.a x © s.a 2 \r e R,s e S.and (r,s) satisfies P}; 

sorting n data points in said sampled set Wc in increasing order as xj<X2^ ... ^x n 
and constructing a set of disjoint intervals //, h such that data in sample W c falls 
15 within one of said disjoint intervals, wherein segmentation for constructing said set of 
disjoint intervals is specified via a vector of indices i(2), i(k)) and the / h interval 
is given by Ij=[xiQ-j)+i,Xj(f)] and length of Ij, denoted by Lj, is given by 

= x i(j) " x #o-i)+i I m d 

wherein the function for optimizing cost associated with said segmentation is 



20 c(s) = w£ + (l - u>) 



l * 



with w being a fixed weight between 0 and 1 and a 
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segmentation that minimizes c is defined by placing adjacent points jc/ and */+/ in the 
same segment if and only if x/+/-*/ < c/*, where d* = A(w/(l-w)). 



12. A computer-based method as per claim 11, wherein widths associated with said 
5 intervals are expanded to avoid additional sampling required to increase right end point to 
equal maximum value in Q c . 



13. A computer-based method as per claim 11, wherein size of said sampled set is 
approximated via the following iterative steps: 

10 (a) given a ^-segmentation, setting counters i=l and k=\ ; 

(b) selecting a sample size wherein n*(k)& ^ x ' p ^ — + — 5 wherein p is 

4f 2 

the probability that at least a fraction of points in Qc that lie outside the intervals is at 
most /; 

(c) obtaining a sample based on (b), computing algebraic constraints, and 
15 identifying a number k of bump intervals; and 

(d) if n>n*(k') or i = i m then utilizing sample size in (b); else setting counters 
k=k f and i=/+7, and returning to step (b). 

14. A computer-based method as per claim 1 1 , wherein A= x n -xj. 

20 
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15. A computer-based method as per claim 11, wherein A is estimated as a generic 
function f(a" , a? , a™, a"), wherein a, M and a™ are maximum and minimum values 
associated with column a h and a 2 and a™ are maximum and minimum values 
associated with column a 2 . 

5 

16. A computer-based method as per claim 1, wherein said method is implemented 
across networks. 

17. A computer-based method as per claim 16, wherein said across networks element 
10 comprises any of, or a combination of the following: local area network (LAN), wide 

area network (WAN), or the Internet. 

18. A computer-based method as per claim 1, wherein said step of constructing 
algebraic constraints further comprises the steps of: 

15 constructing a sample set W c of an induced set Qc, wherein P is a join predicate 

between tables R and S and Q c = {r.a, © r.a 2 : r e R} when the pairing rule P is a trivial 
rule 0 R and Q c = {r.a x © s.a 2 \r e R,se S.and (r,s) satisfies P}; 

sorting n data points in said sampled set W c in increasing order as jcy<x^ ••• 

and constructing a set of disjoint intervals //, such that data in sample W c falls 

20 within one of said disjoint intervals, wherein segmentation for constructing said set of 

disjoint intervals is specified via a vector of indices (i(l), i(2), .. ., i(k)) and the / h interval 
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is given by Ij=[Xio-!)+i,x i(/ )] and length of Ij, denoted by Lj, is given by 
Lj = x i{j) - x, 0 _ 1)+1 ; and said segments are determined by histogramming. 

19. A computer-based method as per claim 18, wherein said histogramming is done 
5 using 2h(n) buckets when h(n)=(2n) I/3 is the oversmoothing lower bound. 

20. A computer-based method as per claim 18, wherein widths associated with said 
intervals are expanded to avoid additional sampling required to increase right end point to 
equal maximum value in Qc- 

10 

21. A computer-based method as per claim 18, wherein size of said sampled set is 
approximated via the following iterative steps: 

(a) given a ^-segmentation, setting counters i=l and k=\; 

(b) selecting a sample size wherein n(k)& ~ + wherein p is 

15 the probability that at least a fraction of points in Qc that lie outside the intervals is at 
most /; 

(c) obtaining a sample based on (b), computing algebraic constraints, and 
identifying a number k of bump intervals; and 

(d) if n>n*(k') or i = i max , then utilizing sample size in (b); else setting counters 
20 k=k' and /=/+/, and returning to step (b). 
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22. An article of manufacture comprising a computer usable medium having 
computer readable program code embodied therein which implements a method for 
automatically finding and exploiting hidden, fuzzy algebraic constraints in a database, 

5 said method comprising the steps of: 

(a) computer readable program code constructing one or more candidates of form 
C=(ay, #2, P, ©X wherein aj and a 2 are numerical attributes associated with column 
values of data in said database, P is a pairing rule, and © is any of the following algebraic 
operators: +, -, x, or /; 

10 (b) computer readable program code constructing, for each candidate identified in 

(a), an algebraic constraint AC=(ay, a 2 , P, ©, //,..., /*) by applying any of, or a 
combination of the following techniques to a sample of column values: statistical 
histogramming, segmentation, or clustering, where //, /* is a set of disjoint intervals 
and k> 7, and 

15 wherein said constructed algebraic constraints are used in query optimization. 

23. An article of manufacture as per claim 22, wherein said medium further 
comprises: 

computer readable program code identifying a set of useful algebraic constraints 
20 via heuristics comprising a set of pruning rules; and 
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computer readable program code partitioning data into compliant data and 
exception data. 

24. An article of manufacture as per claim 23, wherein said medium further 
5 comprises: 

computer readable program code aiding in receiving a query; 
computer readable program code modifying said query to incorporate identified 
constraints; and 

computer readable program code combining results of modified query executed on 
10 data in said database and said original query executed on exception data. 

25. A computer-based method for optimizing a query executed on relational data in a 
database, said method comprising the steps of: 

(a) constructing one or more candidates of form C=(ay, a?, P 9 ©), wherein aj and 
15 d2 are numerical attributes associated with column values of data in said database, P is a 

pairing rule, and © is any of the following algebraic operators: +, x, or /; 

(b) constructing, for each candidate identified in (a), a fuzzy algebraic constraint 
AC=(tf/, d2, P, ©, //,..., h) by applying any of, or a combination of the following 
techniques to a sample of column values: statistical histogramming, segmentation, or 

20 clustering, where //,...,/* is a set of disjoint intervals and k > 1\ 
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(c) identifying a set of useful algebraic constraints via heuristics comprising a set 
of pruning rules; 

(d) partitioning data into compliant data and exception data; and 

(e) modifying said query to incorporate identified constraints, wherein an 
5 optimizer utilizes said identified constraints to identify new and efficient paths. 

26. A computer-based method as per claim 25, wherein said method further 
comprises the step of combining results of modified query executed on data in said 
database and said original query executed on exception data. 

10 

27. A computer-based method as per claim 25, wherein said partitioning is done by 
incrementally maintained materialized views, partial indices, or physical partitioning of 
the table. 

15 28. A computer-based method as per claim 25, wherein said pairing rule P represents 
either a trivial pairing rule 0 R or a join between tables R and S and said pruning rules 
comprise any of, or a combination of the following: 

pairing rule P is of form R.a = S.b or of the form and the number of rows in 
either table R or table S lies below a specified threshold value; 
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pairing rule P is of form R.a = S.b with a e K and the number of distinct values 
in S.b divided by the number of values in R.a lies below a specified threshold value, 
wherein K is a set comprising key-like columns among all columns in said database; 

pairing rule P is of form R.a = S.b, and one or both of R and S fails to have an 
5 index on any of its columns; or 

pairing rule P is of form R.a = S.b with a g K , and 5.6 is a system-generated key. 

29. A computer-based method as per claim 25, wherein said step of constructing one 
or more candidates further comprises the steps of: 

10 generating a set P of pairing rules; and 

for each pairing rule Pep, systematically considering possible attribute pairs (aj, 
a2) and operators © with which to construct candidates. 

30. A computer-based method as per claim 29, wherein said step of generating a set P 

15 of pairing rules further comprises the steps of: 
initializing p to be an empty set; 

adding a trivial pairing rule of the form o R to said set P for each table R in said 
database; and 

generating and adding nontrivial pairing rules to said set P based upon identifying 

20 matching columns via an inclusion dependency, wherein a column b is considered a 
match for column a if: 
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data in columns a and b are of a comparable type; or 

either (i) column a is a declared primary key and column b is a declared 

foreign key for the primary key, or (ii) every data value in a sample from column b 

has a matching value in column a. 

5 

31. A computer-based method as per claim 29, wherein said step of generating a set P of 

pairing rules further comprises the steps of: 
initializing P to be an empty set; 

adding a trivial pairing rule of the form or to said set P for each table R in said 
10 database; and 

generating a set K of key-like columns from among all columns in said database 
with each column in set K belonging to a predefined set of types T, said set K comprising 
declared primary key columns, declared unique key columns, and undeclared key 
columns, wherein said primary keys or declared unique keys are compound keys of form 

15 a = (a 1 ,... ) a m )eT l form>l\ 

adding nontrivial pairing rules to said set P based upon identifying matching 
compound columns via an inclusion dependency wherein, given a compound key 
(aj^.^a^eK, a compound column b is considered a component wise match for 
compound column a if: 

20 data in compound columns a and b are of a comparable type; or 
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either (i) compound column a is a declared primary key and compound 
column b is a declared foreign key for the primary key, or (ii) every data value in a 
sample from compound column b has a matching value in compound column a. 

5 32. A computer-based method as per claim 25, wherein said method is implemented 
across networks. 

33. A computer-based method as per claim 32, wherein said across networks element 
comprises any of, or a combination of the following: local area network (LAN), wide 

10 area network (WAN), or the Internet. 

34. A method to optimize query processing in a database comprising the steps of: 

(a) identifying candidates of form C= (a }) a2, P, ®) by finding declared or 
undeclared key columns and columns related to said declared and undeclared key 

15 columns via an inclusion dependency, wherein aj and a 2 are numerical attributes 

associated with column values of data in said database, P is a pairing rule, and © 
is an algebraic operator; 

(b) for each candidate in (a), identifying a sample set and constructing an 
algebraic constraint AC=(ay, a 2 , P, 0, /;,..., /*) for said sample set by applying 

20 any of, or a combination of the following techniques: statistical histogramming, 

segmentation, or clustering techniques, wherein the sample size is selected to 
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control the number of exception records that fail to satisfy said algebraic 
constraint; 

(c) identifying a set of useful constraints and associated exception tables via 
heuristics comprising pruning rules and creating exception tables to hold said 

5 exception records; and 

(d) during query processing, modifying queries to incorporate said identified 
algebraic constraints with an optimizer utilizing said identified algebraic 
constraints and said created exception tables to accelerate query processing. 

10 35. A query optimization method based upon discovering undeclared and fuzzy rules 
in a database, said method comprising the steps of: 

(a) identifying a set of keys for one or more tables in said database; 

(b) identifying a set of inclusion dependencies covered by said identified keys; 

(c) identifying sets of column pairs in said tables linked by an inclusion 
15 dependency, 

(d) sampling data from columns in each column pair identified in (c), and 

(e) using a data mining algorithm to identify significant patterns between data in 
said columns and utilizing said patterns to derive one or more rules, wherein query 
optimization is attained by modifying a query to incorporate constraints defined by said 

20 derived one or more rules. 
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36. A query optimization method as per claim 35, wherein said method further 
comprises the step of partitioning data into compliant data and exception data based upon 
said identified patterns and said derived one or more rules. 

5 37. A query optimization method as per claim 36, wherein said partitioning is done by 
incrementally maintained materialized views, partial indices, or physical partitioning of 
the table. 

38. A query optimization method as per claim 36, wherein said step of modifying 
10 query to incorporate constraints further comprises the steps of: 
receiving a query; 

modifying said query to incorporate identified constraints; and 
combining results of modified query executed on data in said database and said 
original query executed on exception data. 

15 
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